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Abstract. Transductive learning considers situations when a learner 
observes m labelled training points and u unlabelled test points with 
the final goal of giving correct answers for the test points. This paper 
introduces a new complexity measure for transductive learning called 
Permutational Rademacher Complexity (PRC) and studies its properties. 
A novel symmetrization inequality is proved, which shows that PRC 
provides a tighter control over expected suprema of empirical processes 
compared to what happens in the standard i.i.d. setting. A number of 
comparison results are also provided, which show the relation between 
PRC and other popular complexity measures used in statistical learning 
theory, including Rademacher complexity and Transductive Rademacher 
Complexity (TRC). We argue that PRC is a more suitable complexity 
measure for transductive learning. Finally, these results are combined 
with a standard concentration argument to provide novel data-dependent 
risk bounds for transductive learning. 

Keywords: Transductive Learning, Rademacher Complexity, Statisti¬ 
cal Learning Theory, Empirical Processes, Concentration Inequalities 


1 Introduction 

Rademacher complexities (PH. 0) play an important role in the widely used 
concentration-based approach to statistical learning theory [4], which is closely 
related to the analysis of empirical processes m■ They measure a complexity of 
function classes and provide data-dependent risk bounds in the standard i.i.d. 
framework of inductive learning, thanks to symmetrization and concentration 
inequalities. Recently, a number of attempts were made to apply this machinery 
also to the transductive learning setting [22]. In particular, the authors of [10] 
introduced a notion of transductive Rademacher complexity and provided an 
extensive study of its properties, as well as general transductive risk bounds 
based on this new complexity measure. 
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In the transductive learning, a learner observes m labelled training points 
and u unlabelled test points. The goal is to give correct answers on the test 
points. Transductive learning naturally appears in many modern large-scale ap¬ 
plications, including text mining, recommender systems, and computer vision, 
where often the objects to be classified are available beforehand. There are two 
different settings of transductive learning, defined by V. Vapnik in his book [52] 
Chap. 8]. The first one assumes that all the objects from the training and test 
sets are generated i.i.d. from an unknown distribution P. The second one is dis¬ 
tribution free , and it assumes that the training and test sets are realized by a 
uniform and random partition of a fixed and finite general population of cardi¬ 
nality N := m + u into two disjoint subsets of cardinalities m and u; moreover, 
no assumptions are made regarding the underlying source of this general popu¬ 
lation. The second setting has gained much attentiorU ('[22], [5], [7], [IH] . 0, and 
[20]), probably due to the fact that any upper risk bound for this setting directly 
implies a risk bound also for the first setting [55] Theorem 8.1]. In essence, the 
second setting studies uniform deviations of risks computed on two disjoint fi¬ 
nite samples. Following Vapnik’s discussion in [5] p. 458], we would also like to 
emphasize that the second setting of transductive learning naturally appears as 
a middle step in proofs of the standard inductive risk bounds, as a result of sym- 
metrization or the so-called double-sample trick. This way better transductive 
risk bounds also translate into better inductive ones. 

An important difference between the two settings discussed above lies in the 
fact that the m elements of the training set in the second setting are inter¬ 
dependent, because they are sampled uniformly without replacement from the 
general population. As a result, the standard techniques developed for induc¬ 
tive learning, including concentration and Rademacher complexities mentioned 
in the beginning, can not be applied in this setting, since they are heavily based 
on the i.i.d. assumption. Therefore, it is important to study empirical processes 
in the setting of sampling without replacement. 

Previous work. A large step in this direction was made in m, where the 
authors presented a version of McDiarmid’s bounded difference inequality [5] 
for sampling without replacement together with the Transductive Rademacher 
Complexity (TRC). As a main application the authors derived an upper bound 
on the binary test error of a transductive learning algorithm in terms of TRC. 
However, the analysis of m has a number of shortcomings. Most importantly, 
TRC depends on the unknown labels of the test set. In order to obtain com¬ 
putable risk bounds, the authors resorted to the contraction inequality fl5] . 
which is known to be a loose step mi, since it destroys any dependence on the 
labels. 

Another line of work was presented in m, where variants of Talagrand’s con¬ 
centration inequality were derived for the setting of sampling without replace¬ 
ment. These inequalities were then applied to achieve transductive risk bounds 
with fast rates of convergence o(to -1,/2 ), following a localized approach IT]. In 
contrast, in this work we consider only the worst-case analysis based on the 
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global complexity measures. An analysis under additional assumptions on the 
problem at hand, including Mannnen-Tsybakov type low noise conditions a , is 
an interesting open question and left for future work. 

Summary of our results. This paper continues the analysis of empirical 
processes indexed by arbitrary classes of uniformly bounded functions in the 
setting of sampling without replacement, initiated by HD]. We introduce a new 
complexity measure called permutational Rademacher complexity (PRC) and ar¬ 
gue that it captures the nature of this setting very well. Due to space limitations 
we present the analysis of PRC only for the special case when the training and 
test sets have the same size m = u, which is nonetheless sufficiently illustrative]^. 

We prove a novel symmetrization inequality (Theorem [2]), which shows that 
the expected PRC and the expected suprema of empirical processes when sam¬ 
pling without replacement are equivalent up to multiplicative constants. Quite 
remarkably, the new upper and lower bounds (the latter is often called desym- 
metrization inequality ) both hold without any additive terms when m = u, in 
contrast to the standard i.i.d. setting, where an additive term of order 0(m -1 / 2 ) 
is unavoidable in the lower bound. For TRC even the upper symmetrization in¬ 
equality m Lemma 4] includes an additive term of the order 0{m 1 / 2 ) and no 
desymmetrization inequality is known. This suggests that PRC may be a more 
suitable complexity measure for transductive learning. We would also like to 
note that the proof of our new symmetrization inequality is surprisingly simple, 
compared to the one presented in Ell- 

Next we compare PRC with other popular complexity measures used in sta¬ 
tistical learning theory. In particular, we provide achievable upper and lower 
bounds, relating PRC to the conditional Rademacher complexity (Theorem [3]) . 
These bounds show that the PRC is upper and lower bounded by the conditional 
Rademacher complexity up to additive terms of orders o(to -1 / 2 ) and 0(m -1 / 2 ) 
respectively, which are achievable (Lemma [T]). In addition to this, Theorem [3] 
also significantly improves bounds on the complexity measure called maximum 
discrepancy presented in [2] Lemma 3]. We also provide a comparison between 
expected PRC and TRC (Corollary [1]), which shows that their values are close 
up to small multiplicative constants and additive terms of order 0(m -1 / 2 ). 

Finally, we apply these results to obtain a new computable data-dependent 
risk bound for transductive learning based on the PRC (Theorem[5]), which holds 
for any bounded loss functions. We conclude by discussing the advantages of the 
new risk bound over the previously best known one of (10] . 

2 Notations 

We will use calligraphic symbols to denote sets, with subscripts indicating their 
cardinalities: card(i? m ) = to. For any function / we will denote its average value 
computed on a finite set S by f(S). In what follows we will consider an arbitrary 
space Z (for instance, a space of input-output pairs) and class F of functions 


All the results presented in this paper are also available for the general m ^ u case, 
but we defer them to a future extended version of this paper. 
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(for instance, loss functions) mapping Z to R. Most of the proofs are deferred 
to the last section for improved readability. 

Arguably, one of the most popular complexity measures used in statistical 
learning theory is the Rademacher complexity ([15), [14] . 0 ): 

Definition 1 (Conditional Rademacher complexity). Fix any subset Z m = 
{Z i,..., Z m } C Z. The following random quantity is commonly known as a con¬ 
ditional Rademacher complexity: 


R m (F,Z m ) =E 


' SU PL £ '/( Z <) 


where e = {e,;}(’l 1 are i.i.d. Rademacher signs, taking values ±1 with probabilities 
1/2. When the set Z m is clear from the context we will simply write R m (F). 

As discussed in the introduction, Rademacher complexities play an important 
role in the analysis of empirical processes and statistical learning theory. How¬ 
ever, this measure of complexity was devised mainly for the i.i.d. setting, which 
is different from our setting of sampling without replacement. The following 
complexity measure was introduced in m to overcome this issue: 

Definition 2 (Transductive Rademacher complexity). Fix any set Zjy = 

{Z\, ..., Zn} C Z, positive integers m, u such that N = m + u, and p € [0, . 

The following quantity is called Transductive Rademacher complexity (TRC): 


Rm+u( F , z N,p) 



N 

Slip V Vif(Zi) 


where cr = {cxi}™/) 11 are i.i.d. random variables taking values ±1 with probabili¬ 
ties p and 0 with probability 1 — 2 p. 

We summarize the importance of these two complexity measures in the analysis 
of empirical processes when sampling without replacement in the following result: 


Theorem 1. Fix an N-element subset Zjy C Z and let m < N elements of 
Z m be sampled uniformly without replacement from Zjy. Also let m elements of 
X m be sampled uniformly with replacement from Zjy. Denote Z u := Zjy \ Z m 
with u := card (Z u ) = N — m. The following upper bound in terms of the i.i.d. 
Rademacher complexity was provided in fWfJ : 


E sup ( f(Z u ) - f(Z m )) < — • E 
Zm f£F u x m 


Rm(F, Xm) 


(1) 


The following bound in terms of TRC was provided in m ■ a ssume that func¬ 
tions in F are uniformly bounded by B. Then for po := and cq < 5.05: 


E sup (f(Z u ) - f(Z m )) < fC +u {F , Z N ,p 0 ) + cqB 
Z m f£F 


AIy / min(m, u) 


mu 


( 2 ) 
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While CD did not explicitly appear in [22], it can be immediately derived using 
{2D] Corollary 8] and i.i.d. symmetrization of [T3] Theorem 2.1]. 

Finally, we introduce our new complexity measure: 

Definition 3 (Permutational Rademacher complexity). Let Z m C Z be 

any fixed set of cardinality to. For any n £ {1,..., m — 1} the following quantity 
will be called a permutational Rademacher complexity (PRC): 

Q m ,n(F,Z m ) = E sup (f{Z k ) - f(Z n )) , 
feF 

where Z n is a random subset of Z m containing n elements sampled uniformly 
without replacement and Z k := Z m \ Z n . When the set Z m is clear from the 
context we will simply write Q m ,n{F). 

The name PRC is explained by the fact that if m is even then the definitions 
of Qm,m/ 2 (F) and R m (F) are very similar. Indeed, the only difference is that the 
expectation in the PRC is over the randomly permuted sequence containing equal 
number of “ — 1” and “ + 1”, whereas in Rademacher complexity the average is 
w.r.t. all the possible sequences of signs. The term “permutation complexity” has 
already appeared in US], where it was used to denote a novel complexity measure 
for a model selection. However, this measure was specific to the i.i.d. setting and 
binary loss. Moreover, the bounds presented in [16j were of the same order as 
the risk bounds based on the Rademacher complexity with worse constants in 
the slack term. 


3 Symmetrization and Comparison Results 


We start with showing a version of the i.i.d. symmetrization inequality (refer¬ 
ences can be found in US, [13]) for the setting of sampling without replacement. 
It shows that the expected supremum of empirical processes in this setting is up 
to multiplicative constants equivalent to the expected PRC. 

Theorem 2. Fix an N-element subset Zn C Z and let m < N elements of Z m 
be sampled uniformly without replacement from Zn- Denote Z u := ZN\Z m with 
u := card(i? u ) = N — m. Ifm = u andm is even then for any n £ {1,..., to— 1}: 


kt 


Qm,m/2{F,Z m ) < E SUP (f{Z u ) - f{Z m j) < E 
1 Zm f£F Z m 


Qm,n (C ^m) 


The inequalities also hold if we include absolute values inside the suprema. 
Proof. The proof can be found in Sect. 15.11 


This inequality should be compared to the previously known complexity bounds 
of Theorem |T] First of all, in contrast to © and © the new bound provides 
a two sided control, which shows that PRC is a “correct” complexity measure 
for our setting. It is also remarkable that the lower bound (commonly known as 
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the desymmetrization inequality ) does not include any additive terms, since in 
the standard i.i.d. setting the lower bound holds only up to an additive term of 
order 0(m -1 / 2 ) [13] Sect. 2.1]. Also note that this result does not assume the 
boundedness of functions in F, which is a necessary assumptions both in © and 
in the i.i.d. desymmetrization inequality. 

Next we compare PRC with the conditional Rademacher complexity: 

Theorem 3. Let Z m C Z be any fixed set of even cardinality m. Then: 


Qm,m/2{F, Z m ) < I Id— —— ) Rm{F 1 Z rn ). 

V v 27rm — 2/ 

Moreover, if the functions in F are absolutely bounded by B then 

2 B 


Qm,m/2 (F,Z m )^R m (F,Z m ) 


< 


(3) 

(4) 


The results also hold if we include absolute values inside suprema in Q m ,m Rm- 

Proof. Conceptually the proof is based on the coupling between a sequence 
{ej}"of i.i.d. Rademacher signs and a uniform random permutation {r]i}/L 1 of 
a set containing m/2 plus and m/2 minus signs. This idea was inspired by the 
techniques used in m- The detailed proof can be found in Sect. 15.21 

Note that a typical order of R m (F) is 0(m -1 / 2 ), thus the multiplicative 
upper bound © can be much tighter than the upper bound of ©. We would 
also like to note that Theorem [3] significantly improves bounds of Lemma 3 
in |3], which relate the so-called maximal discrepancy measure of the class F to 
its Rademacher complexity (for the further discussion we refer to Appendix). 
Our next result shows that bounds of Theorem [3] are essentially tight. 

Lemma 1. Let Z m C Z with even m. There are two finite classes F '' m and F'f n 
of functions mapping Z to R and absolutely bounded by 1, such that: 

Qm,m/ 2 (F^Z m ) = 0, (2m)- 1 / 2 < R m {F' ml Z m ) < 2m" 1 ' 2 ; (5) 

Q m ,m/2(FZ,Z m ) = l, 1 -\—<R m (F^Z m )<l-\\—. (6) 

V 7rm 5 V 7rm 

Proof. The proof can be found in Sect. 15.31 

Inequalities © simultaneously show that (a) the order 0(m -1 / 2 ) of the additive 
bound © can not be improved, and (b) the multiplicative upper bound © can 
not be reversed. Moreover, it can be shown using © that the factor appearing 
in © can not be improved to 1 + o(m 1 / 2 ). 

Finally, we compare PRC to the transductive Rademacher complexity: 

Lemma 2. Fix any set Zn = {Z\, ..., Zjy} C Z . If m = u and N = m + u: 

R n (F,Z n ) < R^ +u (F, Z N , 1/4) < 2 R n (F, Z n ). 
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Proof. The upper bound was presented in ma Lemma 1], For the lower bound, 
notice that if p = 1/4 the i.i.d. signs cq presented in Definition [2] have the same 
distribution as e^i, where Ci are i.i.d. Rademacher signs and r/,; are i.i.d. Bernoulli 
random variables with parameters 1/2. Thus, Jensen’s inequality gives: 


R 


m+u (F, Z N , 1/4) = — E 
N Rv) 


m-\-u 

sop V eit]if{Z i ) 


4 

> — E 
~ Ne 


m-\-u 


sup J2 e ilf( Z i) 


L *=i 

Together with Theorems [2] and [3] this result shows that when m = u the PRC 
can not be much larger than transductive Rademacher complexity: 

Corollary 1. Using notations of Theorem [H we have: 

4 


E 

2 m 


Qm,m/2 ( F,Z m ) 


< 2 + 


R% +u (F,Z n , 1/4). 


\/2nN — 2, 

If functions in F are uniformly hounded by B then we also have a lower bound: 


E 

z m 


1 


2 B 


Qm,m/2(F,Z m ) j > -R™ +U (F,Z N , 1/4)-f-^. 

Proof. Simply notice that E z m [sup /eF (f(Z u ) - f(Z m ))] = Q N ,m{F,Z N ). 


4 Transductive Risk Bounds 

Next we will use the results of Sect. [3] to obtain a new transductive risk bound. 
First we will shortly describe the setting. 

We will consider the second, distribution-free setting of transductive learning 
described in the introduction. Fix any finite general population of input-output 
pairs Zm = {{xi,yi)}fL 1 C Xxy, where X and y are arbitrary input and output 
spaces. We make no assumptions regarding underlying source of Zm - The learner 
receives the labeled training set Z m consisting of m < N elements sampled 
uniformly without replacement from Zm - The remaining test set Z u := ZM\Z m 
is presented to the learner without labels (we will use X u to denote the inputs of 
Z u ). The goal of the learner is to find a predictor in the fixed hypothesis class 
R based on the training sample Z m and unlabclled test points X u , which has 
a small test risk measured using bounded loss function t: y x y —> [0,1]. For 
h G TL and (x,y) G Zm denote ih{x,y) = i{h{x),y) and also denote the loss 
class L-u = {ih- h € R}- Then the test and training risks of h € R are defined 
as err u {h) := £h(2- u ) and err m (h) := £h(Z m ) respectively. 

Following risk bound in terms of TRC was presented in uni Corollary 2]: 

Theorem 4 ( 10 ). Ifm = u then with probability at least 1 —J over the random 
training set Z m any h G R satisfies: 

err u (h) < err m (h) + Rm +U (L H ,Z N , 1/4) + 11 y ^ ^ — l /^)2 ' 


(7) 
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Using results of Sect. [3] we obtain the following risk bound: 

Theorem 5. If m = u and n G {1, 1} then with probability at least 

1 — <5 over the random training set Z m any h G H satisfies: 



( 8 ) 


Moreover, with probability at least 1 — 5 any h G H satisfies: 



(9) 


Proof. The proof can be found in Sect. 15.41 

We conclude by comparing risk bounds of Theorems [5] and SJ 

1. First of all, the upper bound of © is computable. This bound is based 
on the concentration argument, which shows that the expected PRC (appearing 
in ©) can be nicely estimated using the training set. Meanwhile, the upper 
bound of 0 depends on the unknown labels of the test set through TRC. 
In order to make it computable the authors of [10. resorted to the contraction 
inequality, which allows to drop any dependence on the labels for Lipschitz losses, 
which is known to be a loose step HU. 

2. Moreover, we would like to note that for binary loss function TRC (as well 

as the Rademacher complexity) does not depend on the labels at all. Indeed, 
this can be shown by writing 4i {y,y') = (i - to')/ 2 for y,y' e and 

noting that eq and cqy are identically distributed for <r,; used in Definition [2j 
This is not true for PRC, which is sensitive to the labels even in this setting. As 
a future work we hope to use this fact for analysis in the low noise setting [4] . 

3. The slack term appearing in © is significantly smaller than the one of 0. 
For instance, if 5 = 0.01 then the latter is 13 times larger. This is caused by the 
additive term in symmetrization inequality 0. At the same time, Corollary [1] 
shows that the complexity term appearing in 0 is at most two times larger 
than TRC, appearing in 0. 

4. Comparison result of Theorem [3] shows that the upper bound of © is also 
tighter than the one which can be obtained using 0 and conditional Rademacher 
complexity. 

5. Similar upper bounds (up to extra factor of 2) also hold for the excess risk 
err u (h m ) — inf/ lg -H err u (h), where h m minimizes the training risk err m over P. 
This can be proved using a similar argument to Theorem [5] 

6. Finally, one more application of the concentration argument can simplify 
the computation of PRC, by estimating the expected value appearing in Defini¬ 
tion [3] with only one random partition of Z m . 
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5 Full Proofs 


5.1 Proof of Theorem [2] 

Lemma 3. For 0 < m < N let S m := {si,..., s m } be sampled uniformly with¬ 
out replacement from a finite set of real numbers C = {ci,..., c/v} C R. Then: 



Proof (of Theorem^). Fix any positive integers n and k such that n + k = m, 
which implies n < m and k < m = u. Note that Lemma [3] implies: 

Rz u ) = e [/($*)], Rz m ) = E [Rs n )], 

where S k and S n are sampled uniformly without replacement from Z u and Z m 
respectively. Using Jensen’s inequality we get: 

E sup ( f(Z u ) - Rz m )) = E sup ( E [f(Sk)] - E [f(S n )] 

Z m f£F Zm fGF \s k L s n 

< E sup (RS k )~ f(S n )). (10) 

(z m ,s k ,s n ) feF 

The marginal distribution of ( Sk,S n ), appearing in (flUl) . can be equivalently 
described by first sampling Z m from Zj^r, then S n from Z m (both times uniformly 
without replacement), and setting Sk '■= Z m \ S n (recall that n + k — m). Thus 


E sup(/(<Sfc) — f(S n )) = E 

(z m ,s k ,S n ) feF z m 


E 

s n 


sup ( f(Z m \ S n ) - f(S n )) 

feF 



which completes the proof of the upper bound. 

We have shown that for n € {1,..., m — 1} and k := m — n: 


E 

Z m 


Z n 


E sup (f(Z k )~ f(Z n )), 

(z k ,z n ) feF 


( 11 ) 


where Z n and Z k are sampled uniformly without replacement from Zn and 
Zn \ Z n respectively. Let Z m ~ n be sampled uniformly without replacement 
from Zn \ ( Z n U Zk) and let Z u _k be the remaining u — k elements of Zn- Using 
Lemma [3] once again we get: 

E [f(Z m - n )\(Z n ,Zk)\ = E [RZ u -k)\(Z n , Z k )] ■ 

We can rewrite the r.h.s. of m as: 

E sup (Rz k ) - Rz n ) + E [Rz u . k ) - f(Z m -n) I (Zn, Zk)} ) 

(Z n ,Z k )f£F 

< e sup (Rz k ) - Rz n ) + Rz u _ k ) - Rz m _ n )), 

f£F 
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where we have used Jensen’s inequality. If we take n* = k* = m/2 we get 


E 

z m 


Qm.,m/2(F : Z m ) < Esup (2 f(z k . uz u _ fe .) - 2 f(z n . uz m _„.)) . 
J f&F 


It is left to notice that the random subsets Z *,» U Z u -k* and Z n * U Z m - n * have 
the same distributions as Z u and Z m . 


5.2 Proof of Theorem [3] 

Let to = 2 • n, e = be i.i.d. Rademacher signs, and 17 = {r]i}^L 1 be a 

uniform random permutation of a set containing n plus and n minus signs. The 
proof of Theorem [3] is based on the coupling of random variables e and 77 , which 
is described in Lemma 21 We will need a number of definitions. Consider binary 
cube B m := {—l,+l} m . Denote S m := {t; £ B m : Y^iL\ v i = 0}, which is a set 
of all the vectors in B m having equal number of plus and minus signs. For any 
v £ B m denote ||u||i =Y^=i\ v i\ an d consider the following set: 

T(v) = arg min ||u — t/||i, 
n'eSm 

which consists of the points in S m closest to v in Hamming metric. For any 
v £ B m let t(y) be a random element of T(v ), distributed uniformly. We will use 
U(v) to denote i-th coordinate of the vector t(v). 

Remark 1. If v £ S m then T(v) = {n}. Otherwise, T(v) will clearly contain 
more than one element of S m . Namely, it can be shown, that if for some positive 
integer q it holds that ^"=1 = 9’ then q is necessarily even and T(v) consists 

of all the vectors in S m which can be obtained by replacing q/2 of +1 signs in v 
with —1 signs, and thus in this case card(T(u)) = 

Lemma 4 (Coupling). Assume that m = 2-n. Then the random sequence t(e) 
has the same distribution as 17 . 

Proof. Note that the support of t(e) is equal to S m . From symmetry it is easy 
to conclude that the distribution of t(e) is exchangable. This means that it is 
invariant under permutations and as a consequence uniform on S m . 

Next result is in the core of the multiplicative upper bound ([3]) . 

Lemma 5. Assume that m = 2 ■ n. For any q £ {1,..., to} the following holds: 

E[e q \t(e)] = - 2~ m (^j^ t q {e) > (l - 2(27r?n)- 1/2 ) t q (e). 

Proof. We will first upper bound P{e g ^ t q (e)\t(e) = e}, where e = {ei} r [L 1 is 
(w.l.o.g.) a sequence of n plus signs followed by a sequence of n minus signs. 

_ P{£g ^ t q {e) n t(e) = e} 

F{t(e) = e} 

= (™) 2 ~ m E ^ ^ n ^ = e i e = ( 12 ) 

^ ' s 


P{e, ^ t q (e)\t{e) = e} 
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where we have used Lemma 0] and the sum is over all different sequences of m 
signs s = {si}^L 1 . For any s denote S(s) = 1 s j an d consider terms in (fl^l) 

corresponding to s with S(s) = 0, S(s) > 0, and S(s) < 0: 

Case 1: S(s) = 0. These terms will be zero, since t(s) = s. 

Case 2 : S(s) > 0. This means that s “has more plus signs than it should” and 
according to Remark[T]the mapping f (•) will replace several of “+1” with “-1” . In 
particular, if s q = — 1 then t q (a) = s q and thus the corresponding terms will be 
zero. If s q = 1 and in the same time e q = 1 the event {e q ^ t q (e) flt(e) = e} also 
can not hold. Moreover, note that identity e = t(s) can hold only if e G T(s), 
which necessarily leads to 

{j G {l,...,m}: sj = -1} C {j G {l,...,m}: e 7 = -l}. (13) 

From this we conclude that if q G {1,..., n} then all the terms corresponding 
to s with S(s) > 0 are zero. We will use U q (e) to denote the subset of B m 
consisting of sequences s, such that (a) S(s) > 0, (b) s q = 1, and (c) condition 
m holds. It can be seen that if s G U q (e) then: 

P{e q j- t q {e) n t(e) = e|e = s} = (^ ' 

This holds since, according to Remark[T] t(e) can take exactly different 

values, while only one of them is equal to e. 

Let us compute the cardinality of U q (e) for q G {n + 1,..., to}. It is easy to 
check that condition S(s) = 2 j for some positive integer j implies that s has 
exactly n — j minus signs. Considering the fact that s q = 1 for s G U q (e) we 
have: 

card(J7,(e)) = 

Combining everything together we have: 

n fn—1\ 

P K ^ ^( e ) n ^ e ) = e l e = s l = 1{<Z > ri} Y TF+W- 

s: S(s )>o i=l \ i ' 

Finally, it is easy to show using induction that: 



Case 3: S(s) < 0. We can repeat all the steps of the previous case and get: 

Y P { e « ± n = e l e = < n}. 

s: S(s )<0 

Accounting for these three cases in (fT21) we conclude that 

P{e 9 / t q (e)\t(e) = e} = 1^2“™ < 

2 \n J V2-7TTO 
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where we have used the upper bound on the binomial coefficient from m Corol¬ 
lary 2.4]. We can conclude the proof of lemma by writing: 

E[e 9 |*(e)] = tq( e ) (! ~ 2p {e<? + *g(e)|t(e)}) > t q (e) (l - 2(2 tt m)~ 1/2 ) . 

Proof (of Theorem [ 5 ]). First we prove (0. Let Z m = {zi, ..., z m }. We can write: 


Q m ,n( F ) = E 


sup-Vt^e)/^) 


< (1 - 2(27rm)" 1/2 ) E 
2 


sup — y'E[e i |t(e)]/(z i ) 
feF m ^ 


i= 1 


< i + 


y/2inn — 2 


E 


sup — V eif(zi) 
f&F rn ^ 


i=1 


(14) 

(15) 

(16) 


where we have used coupling Lemma H] in (fHl) . Lemma [5] in (fl5l) . and Jensen’s 
inequality in (fTCHl . This completes the proof of 0. 

Next we prove 0. We have: 


Qm,n(F) - Rm{F) 

— 

E 

2 m 

sup — E 1 hf{ z i) 

-E 

2 171 1 
sup — V eif(zi) 



V 

[ 171 \ 

e 

[f^ F m \ 


Using Lemma 0] and Jensen’s inequality we further get: 


Qm,n{F) - Rm(F) 


E 

E 

2 m 

sup —y'u(e)/( 0 i ) 

6 

-E 

2 m 

sup — V eif(zi) 

€ 

t 

[feF m ^ 


€ 

[ W m ti \ 


< E 

E 

2 m 

sup — Y]ti{e)f(zi) 

2 771 

- sup — V eif(zi) 

e 

€ 

t 

f^F m ^ 




where we have, perhaps misleadingly, denoted the conditional expectation with 
respect to the uniform choice from T(e) given e using Et[- l e ]- Next we have: 


2 m 

sup — Y]ti(e)f{zi) 

2 m 

- sup — yy eif(zi ) 

< 

sup — yy eif(zi) 




iG5(€,«) 


(18) 


where S(e,t) C {1,... , to} is a subset of indices, s.t. (t(e)) ^ e* iff * G S(e,t). 
We can continue by writing 


2 m 2 m 

sup — X1 ti(e)f(zi) ~ sup — Y ’]e i f(z i ) 


E i/(*)i- ( 19 ) 

ies(e,t) 
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Note that since functions in F are absolutely bounded by B: 

sup Y] \f( z i)\ < B ■ card (S(e, t)). 

fGF i£S(e,t) 


Returning to o and using Remark [T] we obtain: 


Qm,n{ F ) - 2 Rm{F) 



m e t 



Khinchin’s inequality m Lemma 4.1] together with the best known constant 
due to [T2j gives E e [Ei^i e *\ ] — \An, which completes the proof of dj). 


5.3 Proof of Lemma [5] 

Proof. Let Z m = { 21 ,..., z m j. Take F.’' m to be a set of two constant functions, 
fi(z) = 1 and f 2 (z) = 0 for all z £ Z. Clearly, Qm,n{Fm) = 0- lu the same time: 

f 2 

max < 0 , — 

| m 

where we used Khinchin’s inequality. Finally, Khinchin’s inequality also gives: 



Next, let F'f contain ( ”J 2 ) functions, such that their projections on Z m recover 
all the permutations of binary vector containing equal number of 0 and 1. Clearly, 
in this case Q m ,n(Fm ) = 1- Straightforward calculations show that in the same 
time = 1 — and we conclude the proof using upper and lower 

bounds on the binomial coefficient from [151 Corollary 2.4]. 



E 

2 m 

sup -YVfr) 

= E 

C 


€ 


5.4 Proof of Theorem [5] 

The following version of McDiarmid’s bounded difference inequality for the set¬ 
ting of sampling without replacement was presented in uni Lemma 2] and further 
improved in |SJ Theorem 5]: 

Theorem 6 ( [10] , [8]]). Let Z m be sampled uniformly without replacement from 
a fixed set Z m+U C Z ofm+u elements. Let g: Z m —> R be a symmetric function 
s.t. for all i = 1,..., m and for all z±, ... , z m £ Z and z[,. .., z' m G Z, 


9 Oi , . . . , Zm ) - g(z 1 , . . ■ , Zi-!, zl, z i+ l, ... ,Zm ) 


< C. 


( 20 ) 


Then if m = u with probability not less than 1 — <5 the following holds: 

9 < E[g} 


I c 2J\T3 l 0 g(l/(j) 


8 (iV — 1/2 ) 2 
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Note that function sup hen (err^(Z u ) — err h(Z m )) maps (X x y) m to K and is of 
course symmetric. Straightforward calculations show that this function satisfies 
bounded difference condition (l20ll with c = ^ + i (IJOl Inequality 9]). Theorem^ 
states that with probability not less than 1 — 5: 


sup (err u (h) - err m (/i)) < E 
h£H s m 


sup (err u (h) - err m (/i)) 
.hen 


l 2Nlog(l/S) 

(TV-1/2 ) 2 ' 1 ’ 


Using upper bound of Theorem E with Ly. in place of F we complete the proof 
of (ED- Next, consider a symmetric function — Q m ,n{Ly,Z m ) which also maps 
(X x y) m to R. It can be shown again that it satisfies bounded difference condi¬ 
tion with c = —. And thus, Theorem [ 6 ] gives that with probability not less 
than 1 — 5: 


E 

S m 


Qm,n {Ly,Z m ) 

— Qm,n ( Ln 5 ^ m ) “t“ 


/ 27V log(l/<5) 
(TV — 1/2 ) 2 ' 


( 22 ) 


Using this inequality together with (ED m a union bound we obtain the second 
inequality of the theorem. 


Appendix: Improving Lemma 3 of p] 

Let /i be a probability distribution on Z and X m := { X\,... : X m } be i.i.d. 
samples selected according to p. Maximal discrepancy of F was defined in [5] as: 

/ 2 m/2 o m \ 

Dm(F, x rn ) = sup -J2 f(Xi) - - E • 

m i=W2+l J 

It was shown in [2] that if functions in F are uniformly bounded by 1 then: 


■E 


Rm(F, X m ) 


— 2 a / — < E 
m 


D m {F,X m ) < E R m {F : X m ) 


■ 4\l —. (23) 
m 


Since elements in X m are i.i.d. the distribution of Dm is invariant under their 


permutations and thus E D m ( F , X, 


= E 


Qm,m /2 (A 1 , Xm') 


Theorem 0 to significantly improve bounds in 
2 


. Now we can use 


E 


Rm(F , X m ) 


< E 


Dm (F, X„ 


< 1 + 


y/2irm — 2 


E 


Rm {F, X„ 


Acknowledgments 

The authors are thankful to Marius Kloft and Ruth Urner for useful discussions 
and to the anonymous reviewers for their comments. GB aknowledges support of 
the DFG through the FOR-1735 grant. NZ was supported solely by the Russian 
Science Foundation grant (project 14-50-00150). 




























Permutational Rademacher Complexity 


15 


References 

1. Bartlett, P., Bousquet, O., Mendelson, S.: Local rademacher complexities. The An¬ 
nals of Statistics, 33(4), 1497-1537 (2005) 

2. Bartlett, P., Mendelson, S.: Rademacher and Gaussian complexities: Risk bounds 
and structural results. Journal of Machine Learning Research, 3, 463-482 (2001) 

3. Blum, A., Langford, J.: PAC-MDL Bounds. In: COLT 2003, pp. 344-357 (2003) 

4. Boucheron, S., Lugosi, G., Bousquet, O.: Theory of classification: a survey of recent 
advances. ESAIM: Probability and Statistics, 9, 323-375 (2005) 

5. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymp- 
totic Theory of Independence. Oxford University Press (2013) 

6. Chapelle, O., Scholkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press (2006) 

7. Cortes, C., Mohri, M.: On transductive regression. In: NIPS 2006, 305-312 (2007) 

8. Cortes, C., Mohri, M., Pechyony, D., Rastogi, A.: Stability analysis and learning 
bounds for transductive regression algorithms. CoRR abs/0904.0814 (2009) 

9. Derbeko, P., El-Yaniv, R., Meir, R.: Explicit learning curves for transduction and 
application to clustering and compression algorithms. Journal of Artificial Intelli¬ 
gence Research, 22(1), 117-142 (2004) 

10. El-Yaniv, R., Pechyony, D.: Transductive rademacher complexity and its applica¬ 
tions. Journal of Artificial Intelligence Research, 35(1), 193-234 (2009) 

11. Gross, D., Nesme, V.: Note on sampling without replacing from a finite collection 
of matrices, http://arxiv.org/abs/1001.2738v2 (2010) 

12. Haagerup, U.: The best constants in Khinchine inequality. Studia Mathematica, 
70(3), 231-283 (1981) 

13. Koltchinskii, V.: Oracle inequalities in empirical risk minimization and sparse re¬ 
covery problems. Springer (2011) 

14. Koltchinskii, V., Panchenko, D.: Rademacher processes and bounding the risk of 
function learning. In: Gine. D.E., Wellner, J. (eds.) High Dimensional Probability, 
II, pp. 443-457. Birkhauser (1999) 

15. Ledoux, M., Talagrand, M.: Probability in Banach Space. Springer-Verlag (1991) 

16. Magdon-Ismail, M.: Permutation complexity bound on out-sample error. In: Ad¬ 
vances in Neural Information Processing Systems (NIPS 2010), pp. 1531 -1539 (2010) 

17. Mendelson, S.: Learning without Concentration. CoRR abs/1401.0304 (2014) 

18. Pechyony, D.: Theory and Practice of Transductive Learning. PhD thesis (2008) 

19. Stanica, P.: Good lower and upper bounds on binomial coefficients. Journal of 
Inequalities in Pure and Applied Mathematics, 2(3) (2001) 

20. Tolstikhin, I., Blanchard, G., Kloft, M.: Localized complexities for transductive 
learning. In: COLT 2014, pp. 857-884 (2014) 

21. Van der Vaart, A. W., Wellner, J.: Weak Convergence and Empirical Processes: 
With Applications to Statistics. Springer (2000) 

22. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons (1998) 


